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ABSTRACT 

Many fundamental problems in natural language process¬ 
ing rely on determining what entities appear in a given text. 
Commonly referenced as entity linking, this step is a fun¬ 
damental component of many NLP tasks such as text un¬ 
derstanding, automatic summarization, semantic search or 
machine translation. Name ambiguity, word polysemy, con¬ 
text dependencies and a heavy-tailed distribution of entities 
contribute to the complexity of this problem. 

We here propose a probabilistic approach that makes use 
of an effective graphical model to perform collective entity 
disambiguation. Input mentions (i.e., linkable token spans) 
are disambiguated jointly across an entire document by com¬ 
bining a document-level prior of entity co-occurrences with 
local information captured from mentions and their sur¬ 
rounding context. The model is based on simple sufficient 
statistics extracted from data, thus relying on few parame¬ 
ters to be learned. 

Our method does not require extensive feature engineer¬ 
ing, nor an expensive training procedure. We use loopy be¬ 
lief propagation to perform approximate inference. The low 
complexity of our model makes this step sufficiently fast for 
real-time usage. We demonstrate the accuracy of our ap¬ 
proach on a wide range of benchmark datasets, showing that 
it matches, and in many cases outperforms, existing state- 
of-the-art methods. 

Keywords 

Entity linking; Entity disambiguation; Wikification; Prob¬ 
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1. INTRODUCTION 

Digital systems are producing increasing amounts of data 
every day. With daily global volumes of several terabytes of 
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newly textual content, there is a growing need for automatic 
methods for text aggregation, summarization, and, eventu¬ 
ally, semantic understanding. Entity linking is a key step to¬ 
wards these goals as it reveals the semantics of spans of text 
that refer to real-world entities. In practice, this is achieved 
by establishing a mapping between potentially ambiguous 
surface forms of entities and their canonical representations 
such as corresponding WikipedicQ articles or Freebas^ en¬ 
tries. Figure illustrates the difficulty of this task when 
dealing with real-world data. The main challenges arise from 
word ambiguities inherent to natural language: surface form 
synonymy, i.e., different spans of text referring to the same 
entity, and homonymy, i.e., the same name being shared by 
multiple entities. 



Figure 1: An entity disambiguation problem show¬ 
casing five given mentions and their potential entity 
candidates. 

We here describe and evaluate a novel light-weight and 
fast alternative to heavy machine-learning approaches for 
document-level entity disambiguation with Wikipedia. Our 
model is primarily based on simple empirical statistics ac¬ 
quired from a training dataset and relies on a very small 
number of learned parameters. This has certain advantages 
like a very fast training procedure that can be applied to 
massive amounts of data, as well as a better understanding 
of the model compared to increasingly popular deep learn- 

'http://en.Wikipedia.org/ 

^https://www.freebase.com/ 




ing architectures (e.g., He et al. [14| ). As a prerequisite, we 
assume that a given input set of mentions was already dis¬ 
covered via a mention detection procedur^ Our starting 
point is the natural assumption that each entity depends (i) 
on its mention, (ii) its neighboring local contextual words, 
and (hi) on other entities that appear in the same document. 

In order to enforce these conditions, we rely on a con¬ 
ditional probabilistic model that consists of two parts: (1) 
the likelihood of a candidate entity given the referring token 
span and its surrounding context, and (2) the prior joint 
distribution of the candidate entities corresponding to all 
the mentions in a document. Our model relies on the max- 
product algorithm to collectively infer entities for all men¬ 
tions in a given document. 

We further illustrate these modeling decisions. In the ex¬ 
ample depicted in Figure each highlighted mention con¬ 
strains the set of possible entity candidates to a limited size 
set, yet leaves a significant level of ambiguity. However, 
there is one collective way of linking that is jointly consis¬ 
tent with all the chosen entities and supported by contextual 
cues. Intuitively, the related entities Thomas_Muller and 
Germany_national_football_teain are likely to appear in 
the same document, especially in the presence of contextual 
words related to soccer, like “team” or “goal”. 

Our main contributions are outlined below: ( 1 ) We em¬ 
ploy rigorous probabilistic semantics for the entity disam¬ 
biguation problem by introducing a principled probabilis¬ 
tic graphical model that requires a simple and fast train¬ 
ing procedure. (2) At the core of our joint probabilistic 
model, we derive a minimal set of potential functions that 
proficiently explain statistics of observed training data. (3) 
Throughout a range of experiments performed on several 
standard datasets using the Gerbil platform , we demon¬ 
strate competitive or state of the art quality compared to 
some of the best existing approaches. (4) Moreover, our 
training procedure is solely based on publicly available Wiki¬ 
pedia hyperlink statistics and the method does not require 
extensive hyperparameter tuning, nor feature engineering, 
making this paper a self-contained manual of implementing 
an entity disambiguation system from scratch. 

The remainder of this paper is structured as follows: Sec- 
tionj^briefly discusses relevant entity linking literature. Sec¬ 
tion]^ formally introduces our probabilistic graphical model 
and details the initialization and learning procedure of the 
model’s parameters. Section describes the inference pro¬ 
cess used for collective entity resolution. Section empir¬ 
ically demonstrates the merits of the proposed method on 
multiple standard collections of manually annotated docu¬ 
ments. Finally, in Sectionj^ we conclude with a summary of 
our hndings and an overview of ongoing and future work. 

2. RELATED WORK 

There is a substantial body of existing work dedicated to 
the task of entity linking with Wikipedia (Wikification). We 
can identify four major paradigms of how this challenge is 
approached. 

Local models consider the individual context of each entity 
mention in isolation in order to reduce the size of the decision 
space. In one of the early entity linking papers, Mihalcea 

^For example, using a named-entity recognition system. 
However, note that our approach is not restricted to named 
entities, but targets any Wikipedia entity. 


and Csomai propose an entity disambiguation scheme 
based on similarity statistics between the mention context 
and the entity’s Wikipedia page. Milne and Witten fur¬ 
ther refine their scheme with special focus on the mention 
detection step. Bunescu and Pasca present a Wikipedia- 
driven approach, making use of manually created resources 
such as redirect and disambiguation pages. Dredze et al. 
cast the entity linking task as a retrieval problem, treating 
mentions and their contexts as queries, and ranking candi¬ 
date entities according to their likelihood of being referred 
to. 


Global models attempt to jointly disambiguate all men¬ 
tions in a document based on the assumption that the un¬ 
derlying entities are correlated and consistent with the main 
topic of the document. While this approach tends to result 
in superior accuracy, the space of possible entity assignments 
grows combinatorially. As a consequence, many approaches 
in this group rely on approximate inference mechanisms. 
Cucerzan uses high-dimensional vector space representa¬ 
tions of candidate entities and attempts to iteratively choose 
candidates that optimize the mutual proximity to existing 
candidates. Kulkarni et al. exploit topical information 
about candidate entities and try to harmonize these topics 
across all assigned entities. Ratinov et al. 27 prune the list 
of entity mentions using support vector machines trained 
on a range of similarity and term overlap features between 
entity representations. Ferragina and Scaiella focus on 
short documents such as tweets or search engine snippets. 
Based on evidence across all mentions, the authors employ 
a voting scheme for entity disambiguation. Cheng et al. 
and Singh et al. 31 describe models for jointly capturing 
the interdependence between the tasks of entity tagging, re¬ 
lation extraction and co-reference resolution. Similarly, Dur- 
rett and Klein describe a graphical model for collectively 
addressing the tasks of named entity recognition, entity dis¬ 
ambiguation and co-reference resolution. 

Graph-based models establish relationships between can¬ 
didate entities and mentions using structural models. For 
inference, various approaches are employed, ranging from 
densest graph estimation algorithms (Hoffart et al. [^) to 
graph traversal methods such as random graph walks (Guo 
and Barbosa [^, Han et al. [^). In a similar fashion, these 
techniques can be combined to enhance the quality of both 
entity linking and word sense disambiguation in a synergistic 
solution (Moro et al. [23|). 

The above approaches are limited because they assume a 
single topic per document. Naturally, topic modelling can be 
used for entity disambiguation by attempting to harmonize 
the individual distribution of latent topics across candidate 
entities. Houlsby and Giaramita [16] and Pilz and Paafi 
rely on Latent Dirichlet Allocation (LDA) and compare the 


resulting topic distribution of the input document to the 
topic distributions of the disambiguated entities’ Wikipedia 
pages. Han and Sun propose a joint model of mention 
context compatibility and topic coherence, allowing them to 
simultaneously draw from both local (terms, mentions) as 
well as global (topic distributions) information. Kataria et 
al. use a semi-supervised hierarchical LDA model based 
on a wide range of features extracted from Wikipedia pages 
and topic hierarchies. 

In contrast to previous work on this problem, our method 
exploits co-occurrence statistics in a fully probabilistic man¬ 
ner using a graph-based model that addresses collective en- 





tity disambiguation. It combines a clean and light-weight 
probabilistic model with an elegant, real-time inference algo¬ 
rithm. An advantage over increasingly popular deep learn¬ 
ing architectures for entity linking (e.g. Sun et al. [^, He 
et al. 14 ) is the speed of our training procedure that relies 


on count statistics from data and that learns only very few 
parameters. State-of-art accuracy is achieved without the 
need for special-purpose computational heuristics. 


3. PROBABILISTIC MODEL 

In this section, we formally define the entity linking task 
that we address in this work and describe our modeling ap¬ 
proach in detail. 

3.1 Problem Definition and Formulation 

Let £ be a knowledge base (KB) of entities, V a finite dic¬ 
tionary of phrases or names and C a context representation. 
Formally, we seek a mapping F : {V,C)" —>■ f", that takes 
as input a sequence of linkable mentions m = (mi,..., m„) 
along with their contexts c = (ci,... ,c„) and produces a 
joint entity assignment e = (ei,...,e„). Here n refers to 
the number of linkable spans in a document. Our problem 
is also known as entity disambiguation or link generation in 
the literature. 0 

We can construct such a mapping F in a probabilistic ap¬ 
proach, by learning a conditional probability model p(e|m, c) 
from data and then employing (approximate) probabilistic 
inference in order to find the maximum a posteriori (MAP) 
assignment, hence: 

7 ^( 111 , c) arg maxp(e|m, c) . (1) 

In the sequel, we describe how to estimate such a model 
from a corpus of entity-linked documents. Finally, we show 
in Section]^ how to apply belief propagation (max-product) 
for approximate inference in this model. 

3.2 Maximum Entropy Models 

Assume a corpus of entity-linked documents is available. 
Specifically, we used the set of Wikipedia pages together 
with their respective Wiki hyperlinks. These hyperlinks are 
considered ground truth annotations, the mention being the 
linked span of text and the truth entity being the Wikipedia 
page it refers to. One can extract two kinds of basic statis¬ 
tics from such a corpus: First, counts of how often each 
entity was referred to by a specific name. Second, pairwise 
co-occurrence counts for entities in documents. Our fun¬ 
damental conjecture is that most of the relevant informa¬ 
tion needed for entity disambiguation is contained in these 
counts, that they are sufficient statistics. We thus request 
that our probability model reproduces these counts in ex¬ 
pectation. As this alone typically yields an ill-defined prob¬ 
lem, we follow the maximum entropy principle of Jaynes : 
Among the feasible set of distributions we favor the one with 
maximal entropy. 

Formally, let F be an entity-linked document collection. 
Ignoring mention contexts for now, we extract for each doc¬ 
ument d G U a sequence of mentions and their cor¬ 
responding target entities both of length As- 

"^Note that we do not address the issues of mention de¬ 
tection or nil identification in this work. Rather, our input 
is a document along with a fixed set of linkable mentions 
corresponding to existing KB entities. 


suming exchangeability of random variables within these se¬ 
quences, we reduce each (e, m) to statistics (or features) 
about mention-entity and entity-entity co-occurrence as fol¬ 
lows: 

n 

(/)e,m(e,m) := ^l[ei = = m], V(e,m)G£ixV (2) 

i=l 

V>{e.e'}(e) := ^l[{ei,ej} = {e,e'}], Ve, e' G f , (3) 

i<j 

where ![•] is the indicator function. Note that we use the 
subscript notation {e, e'} for ip to take into account the sym¬ 
metry in e,e' as well the fact that one may have e = e'. 

The document collection provides us with empirical esti¬ 
mates for the expectation of these statistics under an i.i.d. sam¬ 
pling model for documents, namely the averages 

(pe.miF) ■- , ( 4 ) 

' ' dGT> 

V'{e.e'}(®) := X] ' (5) 

I I dex) 

Note that in entity disambiguation, the mention sequence 
m is always considered given, while we seek to predict the 
corresponding entity sequence e. It is thus not necessary 
to try to model the joint distribution p(e, m), but sufficient 
to construct a conditional model p(e|m). Following Berger 
et al. this can be accomplished by taking the empirical 
distribution p(m|D) of mention sequences and combining it 
with a conditional model via p(e, m) = p(e]m)-^(111111). We 
then require that: 

F,p[cj}e,m] = (pe.miF) aud Ep [V>{e,e'}] = V'fe.e'} (®): (6) 

which yields |£i|-|V|-|-(' 2 ')-|-|f | moment constraints onp(e|m). 

The maximum entropy distributions, fulfilling constraints 
as stated in Eq. 0 form a conditional exponential family 
for which m) and ■) are sufficient statistics. We thus 
know that there are canonical parameters pe,m and A^;e,e'} 
(formally corresponding to Lagrange multipliers) such that 
the maximum entropy distribution can be written as 

p(e|m; p. A) = —- exp [(p, cf){e, m)) -|- (A, iA(e))] (7) 

where Z{m) is the partition function 

Z{m) := exp[(p,</)(e,m)) + (A,'!/’(e))] . (8) 

Here we interpret (e, m) and {e, e'} as multi-indices and sug¬ 
gestively define the shorthands 

{Pj d) ■ ^ ^ Pe,m0e,m; (A, ”0) • ^ ( ^{e,e'}'f^{e,e'} ■ (9) 

e,m {e,e'} 

Note that we can switch between the statistics view and the 
raw data view by observing that 

n 

(p, 0(e, m)) = Y = Y ■ (10) 

i=i i<j 

While the maximum entropy principle applied to our funda¬ 
mental conjecture restricts the form of our model to a finite¬ 
dimensional exponential family, we need to investigate ways 
of finding the optimal or - as we will see - an approximately 
optimal distribution in this family. To that extent, we first 
re-interpret the obtained model as a factor graph model. 






Figure 2: Proposed factor graph for a document 
with four mentions. Each mention node rrii is paired 
with its corresponding entity node Ei, while all en¬ 
tity nodes are connected through entity-entity pair 
factors. 


3.3 Markov Network and Factor Graph 

Complementary to the maximum entropy estimation per¬ 
spective, we want to present a view on our model in terms of 
probabilistic graphical models and factor graphs. Inspecting 
Eq. 0 and interpreting (p and ip as potential functions, we 
can recover a Markov network that makes conditional inde¬ 
pendence assumptions of the following type: an entity link 
Bi and a mention rrij with i ^ j are independent, given rrii 
and e_i, where e_i denotes the set of entity variables in the 
document excluding Bi. This means that a mention rrij only 
influences a variable Bi through the intermediate variable 
Bj. However, the functional form in Eq. 0 goes beyond 
these conditional independences in that it limits the order 
of interaction among the variables. A variable Bi interacts 
with neighbors in its Markov blanket through pairwise po¬ 
tentials. In terms of a factor graph decomposition, p(e|m) 
decomposes into functions of two arguments only, modeling 
pairwise interactions between entities on one hand, and be¬ 
tween entities and their corresponding mentions on the other 
hand. 

We emphasize the factor model view by rewriting 0 as 
p(e|m;p,A) oc 0 exp • 0exp (H) 

i i<j 

where we think of p and A as functions 

p : £■ X V —>• R, (e, m) !->■ 

A : £■ U —>■ R, {e, b} >->■ A^^e.e'} 

An example of a factor graph (n = 4) is shown in Figure 
We will investigate in the sequel how the factor graph 
structure can be further exploited. 

3.4 (Pseudo-)Likelihood Maximization 

While the maximum entropy approach directly motivates 
the exponential form of Eq. 0 and is amenable to a plausi¬ 
ble factor graph interpretation, it does not by itself suggest 
an efficient parameter htting algorithm. As is known by 
convex duality, the optimal parameters can be obtained by 
maximizing the conditional likelihood of the model under 
the data, 

/:(p, A; D) = ^ log p(e^‘*'|m^‘'^;p, A) (12) 

d 


putation of gradients of jC, which involves evaluating expec¬ 
tations with regard to the model, since 

VplogZ(m) = Ep(/)(e,m), Va log Z(m) = Ep?/’(e) . (13) 

The exact inference problem of computing these model ex¬ 
pectations, however, is not generally tractable due to the 
pairwise couplings through the ?/)-statistics. 


As an alternative to maximizing the likelihood in Eq. ( |12[ ), 
we have investigated an approximation known as, pseudo¬ 
likelihood maximization 35 38 . Its main benefits are low 


computational complexity, simplicity and practical success. 
Switching to the Markov network view, the pseudo-likelihood 
estimator predicts each variable conditioned on the value of 
all variables in its Markov blanket. The latter consists of the 
minimal set of variables that renders a variable condition¬ 
ally independent of everything else. In our case the Markov 
blanket consists of all variables that share a factor with a 
given variable. Consequently, the Markov blanket of Bi is 
N(Bi) := (mi,e_i). The posterior is then approximated in 
the pseudo-likelihood approach as: 


p(e|m; p, A) := 0p(ei|A/'(ei); p. A), (14) 

i = l 

which results in the tractable log-likelihood function 

pp,X;V) ■- ^ ^logp(ef’|A/'(ef');p,A). (15) 

d^T) i = l 

Introducing additional L 2 -norm penalties 7(||A ||2 + ||p||i) 
to further regularize C, we have utilized parallel stochastic 
gradient descent (SGD) [28| with sparse updates to learn 
parameters p, A. From a practical perspective, we only keep 
for each token span m parameters pe,m for the most fre¬ 
quently observed entities e. Moreover, we only use A^^e.e'} 
for entity pairs (e, b') that co-occurred together a sufficient 
number of times in the collection As we will discuss in 
more detail in Section our experimental Endings suggest 
this brute-force learning approach to be somewhat ineEec- 
tive, which has motivated us to develop simpler, yet more 
efiective plug-in estimators as described below. 


3.5 Bethe Approximation 

The major computational difficulty with our model lies 
in the pairwise couplings between entities and the fact that 
these couplings are dense: The Markov dependency graph 
between different entity links in a document is always a com¬ 
plete graph. Let us consider what would happen, if the 
dependency structure were loop-free, i.e., it would form a 
tree. Then we could rewrite the prior probability in terms of 
marginal distributions in the so-called Beths form. Encod¬ 
ing the tree structure in a symmetric relation T, we would 
get 


p(e) 




di ■■=\{j ■■ {i,j} eT}\. (16) 


The Bethe approximation pursues the idea of using the 
above representation as an unnormalized approximation for 


However, specialized algorithms for maximum entropy esti¬ 
mation such as generalized iterative scaling are known to 
be slow, whereas gradient-based methods require the com- 


®For the Wikipedia collection, even after these pruning 
steps, we ended up with more than 50 million parameters in 
total. 














p(e), even when the Markov network has cycles. How does 
this relate to the exponential form in Eq. Q? By simple 
pattern matching, we see that if we choose 


sum in Eq. 0 scale with n. With this simple change, a sub¬ 
stantial accuracy improvement was observed empirically, the 
details of which are reported in our experiments. 


A{e,e'} = log 


p(e,e') 


Ve, e gS 


^p{e)p{e' 

we can apply Eq. ( |16| l to get an approximate distribution 

i=l 


(17) 


p(e) oc 




■■ exp 


^logp(ei) 

. i i<j 


(18) 


where we see the same exponential form in A appearing as 
in Eq. (lOl. We complete this argument by observing that 
with 


The re-calibration in Eq. (201 can also be justified by the 


following combinatorial argument: For a given set Y of ran¬ 
dom variables, define an Y-cycle as a graph containing as 
nodes all variables in Y, each with degree exactly 2, con¬ 
nected in a single cycle. Let H be the set enumerating all 
possible Y-cycles. Then, |H|= (n — 1)!, where n is the size 
of Y. 

In our case, if the entity variables e per document would 
have formed a cycle of length n instead of a complete sub¬ 
graph, the Bethe approximation would have been written 
as: 


P7r(e) oc 


n 


(iJ)eE(n) 


p{ei,ej 


Y\iP{ei 


Vtt G 


( 21 ) 


Pe,m = logp(e) -I- logp(m|e) (19) 

we obtain a representation of a joint distribution that ex¬ 
actly matches the form in Eq. 0- 

What have we gained so far? We started from the desire 
of constructing a model that would agree with the observed 
data on the co-occurrence probabilities of token spans and 
their linked entities as well as on the co-link probability of 
entity pairs within a document. This has led to the con¬ 
ditional exponential family in Eq. 0 . We have then pro¬ 
posed pseudo-likelihood maximization as a way to arrive at a 
tractable learning algorithm to try to fit the massive amount 
of parameters p and A. Alternatively, we have now seen 
that a Bethe approximation of the joint prior p(e) yields 
a conditional distribution p(e|m) that (i) is a member of 
the same exponential family, (ii) has explicit formulas for 
how to choose the parameters from pairwise marginals, and 
(hi) would be exact in the case of a dependency tree. We 
claim that the benefits of computational simplicity together 
with the correctness guarantee for non-dense dependency 
networks outweighs the approximation loss, relative to the 
model with the best generalization performance within the 
conditional exponential family. In order to close the subop¬ 
timality gap further, we suggest some important rehnements 
below. 


3.6 Parameter Calibration 

With the previous suggestion, one issue comes into play: 
The total contribution coming from the pairwise interac¬ 
tions between entities will scale with ( 2 ), while the entity- 
mention compatibility contributions will scale with n, the 
total number of mentions. This is a direct observation of 
the number of terms contributing to the sums in | |10[ ). How¬ 
ever, for practical reasons, it is somewhat implausible that, 
as n grows, the prior p(e) should dominate and the contri¬ 
bution of the likelihood term should vanish. The model is 
not well-calibrated with regard to n. 

We propose to correct for this effect by adding a normal¬ 
ization factor to the A-parameters by replacing 0 with: 



( P{e,e') \ 

\p{e)-p{e')J ’ 


Ve ,e gS 


( 20 ) 


where now these parameters scale inversely with n, the num¬ 
ber of entity links in a document, making the corresponding 


where E{tt) is the set of edges of the e-cycle tt. However, 
as we do not desire to further constrain our graph with ad¬ 
ditional independence assumptions, we propose to approx¬ 
imate the joint prior p(e) by the average of the Bethe ap¬ 
proximation of all possible n, that is 

logp(e) ~ |4| ^ logp^(e) . (22) 

Tres 


Since each pair {ei,ej) would appear in exactly 2(n — 2)! 
e-cycles, one can derive the hnal approximation: 


p{e) 


ni<jP(e»ej)"-" 

YliP^ei) 


(23) 


Distributing marginal probabilities over the parameters start¬ 
ing from Eq. | |23[ ) and applying a similar argument as in 
Eq. (181 results in the assignment given by Eq. (20l. While 
the above line of argument is not a strict mathematical 
derivation, we believe this to shed further light on the em¬ 
pirically observed effectiveness of the parameter re-scaling. 


3.7 Integrating Context 

The model that we have discussed so far does not consider 
the local context of a mention. This is a powerful source of 
information that a competitive entity linking system should 
utilize. For example, words like “computer”, “company” or 
“device” are more likely to appear near references of the 
entity Apple_Inc. than of the entity Apple_fruit. We 
demonstrate in this section how this integration can be eas¬ 
ily done in a principled way on top of the current probabilis¬ 
tic model. This showcases the extensibility of our approach. 
Enhancing our model with additional knowledge such as en¬ 
tity categories or word co-reference can also be done in a 
rigorous way, so we hope that this provides a template for 
future extensions. 

As stated in Section |3.H for each mention rrii in a doc¬ 
ument, we maintain a context representation a consisting 
of the bag of words surrounding the mention within a win¬ 
dow of length aQ Hence, d can be viewed as an additional 
random variable with an observed outcome. At this stage, 
we make additional reasonable independence assumptions 
that increase tractability of our model. First, we assume 

^Throughout our experiments, we used a context window 
of size K — 100, intuitively chosen and without extensive 
validation. 

















that, knowing the identity of the linked entity a, the men¬ 
tion token span rrii is just the surface form of the entity, so 
it brings no additional information for the generative pro¬ 
cess describing the surrounding context a. Formally, this 
means that rrii and a are conditionally independent given 
Ci. Consequently, we obtain a factorial expression for the 
joint model 


p(e,m,c) =p(e)p(m,c|e) = p(e) ]^p(mi|ei)f’(ci|ei) (24) 

i=l 

This is a simple extension of the previous factor graph that 
includes context variables. Second, we assume conditional 
independence of the words in d given an entity a which let 
us factorize the context probabilities as 

p(ci|ei) = n (25) 

WjGci 


Note that this assumption is commonly made in models us¬ 
ing bag-of-word representations or naive Bayes classihers. 

While this completes the argument from a joint model 
point of view, we need to consider one more aspect for the 
conditional distribution p(e|m, c) that we are interested in. 
If we cannot afford (computationally as well as with regard 
to training data size) a full-blown discriminative learning 
approach, then how do we balance the relative influence of 
the context a and the mention token span rrii on ei? For 
instance, the effect of a will depend on the chosen window 
size K, which is not realistic. 

To address this issue, we resort to a hybrid approach, 
where, in the spirit of the Bethe approximation, we continue 
to express our model in terms of simple marginal distribu¬ 
tions that can be easily estimated independently from data, 
yet that allow for a small number of parameters (in our case 
“small” equals 2) to be chosen to optimize the conditional 
log-likelihood p(e|m,c). We thus introduce weights ^ and 
r that control the importance of the context factors and, 
respectively, of the entity-entity interaction factors. Putting 
equations (191, (|20[), (|24|| and (251 together, we arrive at 


the final model that will be subsequently referred to as the 
PBoH model (Probabilistic Bag of Hyperlinks): 


logp(e|m,c)=^ logp(ei|mi) + C ^ logp{wj\ei) 

i=l \ Wj Gci . 

+ +const. 


(26) 


tity appeared referenced by a given nam^ We also compute 
the pairwise probabilities p{e,e') obtained by counting the 
pairwise co-occurrence of entities e and e' within the same 
document. Similarly, we obtained empirical values for the 
marginals p{e) = Yle' P(®> for the context word-entity 

statistics p{w\e). 

In the absence of huge amounts of data, estimating such 
probabilities from counts is subject to sparsity. For instance, 
in our statistics, there are 8 times more distinct pairs of 
entities that co-occur in at most 3 Wikipedia documents 
compared to the total number of distinct pairs of entities 
that appear together in at least 4 documents. Thus, it is 
expected that the heavy tail of infrequent pairs of entities 
will have a strong impact on the accuracy of our system. 

Traditionally, various smoothing techniques are employed 
to address sparsity issues arising commonly in areas such as 
natural language processing. Out of the wealth of methods, 
we decided to use the absolute discounting smoothing tech¬ 
nique that involves interpolation of higher and lower or¬ 
der (backoff) models. In our case, whenever insufficient data 
is available for a pair of entities (e, e'), we assume the two 
entities are drawn from independent distributions. Thus, 
if we denote by N{e,e') the total number of corpus docu¬ 
ments that link both e and e', and by the total number 
of pairs of entities referenced in each document, then the 
final formula for the smoothed entity pairwise probabilities 
is: 


p{e,e) 


max(A''(e, e') — 5,0) , 


Me)p(e)p(e') 


(27) 


where <5 G [0,1] is a fixed discount and is a constant that 
assures that P(®i ~ 1- ^ was set by performing a 

coarse grid search on a validation set. The best S value was 
found to be 0.5. 

The word-entity empirical probabilities p{w\e) were com¬ 
puted based on the Wikipedia corpus by counting the fre¬ 
quency with which word w appears in the context windows 
of size K around the hyperlinks pointing to e. In order 
to avoid memory explosion, we only considered the entity- 
words pairs for which these counts are at least 3. These 
empirical estimates are also sparse, so we used absolute dis¬ 
counting smoothing for their correction by backing off to the 
unbiased estimates p(w). The latter can be much more ac¬ 
curately estimated from any text corpus. Finally, we obtain: 


p(w|e) 


max(iV(w, e) - i^, 0) , 

IV 

2 V ujp 


fJ.w)p{w) . 


(28) 


Again ^ € [0,1] was optimized by grid search to be 0.5. 


4. INFERENCE 


Here we used the identity p(m|e)p(e) = p(e|m)p(m) and ab¬ 
sorbed all logp(m) terms in the constant. We use grid-search 
on a validation set for the remaining problem of optimizing 
over the parameters r. Details are provided in section]^ 

3.8 Smoothing Empirical Probabilities 

In order to estimate the probabilities involved in Eq. ( |26[ ), 
we rely on an entity annotated corpus of text documents, 
e.g., Wikipedia Web pages together with their hyperlinks 
which we view as ground truth annotations. From this cor¬ 
pus, we derive empirical probabilities for a name-to-entity 
dictionary p{m\e) based on counting how many times an en- 


After introducing our model and showing how to train it 
in the previous section, we now explain the inference process 
used for prediction. 


4.1 Candidate Selection 


At test time, for each mention to be disambiguated, we 
first select a set of potential candidates by considering the 
top R ranked entities based on the local mention-entity prob¬ 
ability dictionary p(e|m). We found i? = 64 to be a good 


^In our implementation we summed the mention-entity 
counts fro m Wikipedia hyperlinks with the Crosswikis 
counts 
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compromise between efficiency and accuracy loss. Second, 
we want to keep the average number of candidates per men¬ 
tion as small as possible in order to reduce the running time 
which is quadratic in this number (see the next section for 
details). Consequently, we further limit the number of can¬ 
didates per mention by keeping only the top 10 entity can¬ 
didates re-ranked by the local mention-context-entity com¬ 
patibility defined as 


Dataset 

ff non-NIL mentions 

if documents 

AIDA test A 

4791 

216 

AIDA test B 

4485 

231 

MSNBC 

656 

20 

AQUAINT 

727 

50 

ACE04 

257 

35 


Table 1: Statistics on some of the used datasets 


log p(ei I mi, Ci) = log p(ei I mi)-I-^ ^ logp(u;j|ei)-fconst. 

(29) 

These pruning heuristics result in a significantly improved 
running time at an insignificant accuracy loss. 

If the given mention is not found in our map p(e|m), we 
try to replace it by the closest name in this dictionary. Such 
a name is picked only if the Jaccard distance between the 
set of letter trigrams of these two strings is smaller than a 
threshold that we empirically picked as 0.5. Otherwise, the 
mention is not linked at all. 


In the end, the Hnal entity assignment is determined by: 

— argmax | E (32) 

\ l<fc<n J 

The complexity of the belief propagation algorithm is, in 
our case, 0(n^ ' ’'^)i with n being the number of mentions 
in a document and r being the average number of candidate 
entities per mention (10 in our case). More details regarding 
the run-time and convergence of the loopy BP algorithm can 
be found in Section |5l 


4.2 Belief Propagation 

Collectively disambiguating all mentions in a text involves 
iterating through an exponential number of possible entity 
resolutions. Exact inference in general graphical models is 
NP-hard, therefore approximations are employed. We pro¬ 
pose solving the inference problem through the loopy belief 
propagation (LBP) technique, using the max-product al¬ 
gorithm that approximates the MAP solution in a run-time 
polynomial in n, the number of input mentions. For the sake 
of brevity, we only present the algorithm for the maximum 
entropy model described by Eq. 0 ; A similar approach was 
used for the enhanced PBoH model given by Eq. ( |26[ ). 

Our proposed graphical model is a fully connected graph 
where each node corresponds to an entity random variable. 
Unary potentials exp(pm,e) model the entity-mention com¬ 
patibility, while pairwise potentials exp(A{e,e'}) express enti¬ 
ty-entity correlations. For the posterior in Eq. 0 , one can 
derive the update equation of the logarithmic message that 
is sent in round t -|-1 from entity random variable Ei to the 
outcome ej of the entity random variable Ej : 

= (30) 

max I Pei,mi+X(ei,ej}+ E (Ci ) 1 

^ \ l<k<n;k^j j 


Note that, for simplicity, we skip the factor graph frame¬ 
work and send messages directly between each pair of entity 
variables. This is equivalent to the original BP framework. 

We chose to update messages synchronously: in each round 
t, each two entity nodes Ei and Ej exchange messages. This 
is done until convergence or until an allowed maximum num¬ 
ber of iterations (15 in our experiments) is reached. The 
convergence criterion is: 


max 






EAej) -m%.^Eiiej)\< e 


(31) 


where e = 10 ®. This setting was sufficient in most of the 
cases to reach convergence. 


5. EXPERIMENTS 

We now present the experimental evaluation of our method. 
We first uncover some practical details of our approach. Fur¬ 
ther, we show an empirical comparison between PBoH and 
well known or recent competitive entity disambiguation sys¬ 
tems. We use the Gerbil testing platform version 1.1.4 
with the D2KB setting in which a document together with a 
fixed set of mentions to be annotated are given as input. We 
run additional experiments that allow us to compare against 
more recent approaches, such as and [11| . 

Note that in all the experiments we assume that we have 
access to a set of linkable token spans for each document. In 
practice this set is obtained by hrst applying a mention de¬ 
tection approach which is not part of our method. Our main 
goal is then to annotate each token span with a Wikipedia 
entitjQ 

Evaluation metrics. We quantify the quality of an entity 
linking system by measuring common metrics such as pre¬ 
cision, recall and Fi scores. 

Let M* be the ground truth entity annotations associated 
with a given set of mentions X. Note that in all the results 
reported, mentions that contain NIL or empty ground truth 
entities are discarded before the evaluation; this decision is 
taken as well in Gerbil version 1.1.4. Let M be the output 
annotations of an entity disambiguation system on the same 
input. Then, our quality metrics are computed as follows: 

• Precision: P = ^ 

• Recall: R = 

• El score: Ei = 

We mostly report results in terms of Ei scores, namely 
macro-averaged F1@MA (aggregated across documents), 
and micro-averaged FIOMI (aggregated across mentions). 
For a fair comparison with Houlsby and Ciaramita [^, we 

®In PBoH, we refrain from annotating mentions for which 
no candidate entit y is found according to the procedure de¬ 
scribed in Section 4.11 
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AGDISTIS 

65.83 

77.63 

60.27 

56.97 

59.06 

53.36 

58.32 

58.03 

61.05 

57.53 

60.10 

58.62 

36.61 

33.25 

41.23 

43.38 

34.16 

30.20 

42.43 

61.08 

50.39 

62.87 

75.42 

73.82 

67.95 

75.52 

59.88 

70.80 

Babelfy 

63.20 

76.71 

78.00 

73.81 

75.77 

71.26 

80.36 

74.52 

78.01 

74.22 

72.27 

73.23 

51.05 

51.97 

57.13 

55.36 

73.12 

69.77 

47.20 

62.11 

50.60 

61.02 

78.17 

75.73 

58.61 

59.87 

69.17 

76.00 

DBpedia Spotlight 

70.38 

80.02 

58.84 

60.59 

54.90 

54.11 

57.69 

61.34 

60.04 

62.23 

74.03 

73.13 

69.27 

67.23 

65.44 

62.81 

37.59 

32.90 

56.43 

71.63 

56.26 

67.99 

69.27 

69.82 

56.44 

58.77 

57.63 

65.03 

Dexter 

18.72 

16.97 

48.46 

45.29 

45.44 

42.17 

48.59 

46.20 

49.25 

45.85 

38.28 

38.15 

26.70 

22.75 

28.53 

28.48 

17.20 

12.54 

31.27 

44.02 

35.21 

42.07 

36.86 

39.42 

32.74 

31.85 

31.11 

33.55 

Entityclassifier.eu 

12.74 

12.3 

46.6 

42.86 

44.13 

42.36 

44.02 

41.31 

47.83 

43.36 

21.67 

19.59 

22.59 

18.0 

18.46 

19.54 

27.97 

25.2 

29.12 

39.53 

32.69 

38.41 

41.24 

40.3 

28.4 

24.84 

21.77 

22.2 

Kea 

80.08 

87.57 

73.39 

73.26 

70.9 

67.91 

72.64 

73.31 

74.22 

74.47 

81.84 

81.27 

73.63 

76.60 

72.03 

70.52 

57.95 

53.17 

63.4 

76.54 

64.67 

74.32 

85.49 

87.4 

63.2 

64.45 

69.29 

75.93 

NERD-ML 

54.89 

72.22 

54.62 

52.35 

52.85 

49.6 

52.59 

51.34 

55.55 

53.23 

49.68 

46.06 

46.8 

45.59 

51.08 

49.91 

29.96 

24.75 

38.65 

57.91 

39.83 

53.74 

64.03 

67.28 

54.96 

62.9 

61.22 

67.3 

TagMe 2 

81.93 

89.09 

72.07 

71.19 

69.07 

66.5 

70.62 

70.38 

73.2 

72.45 

76.27 

75.12 

63.31 

65.1 

57.23 

55.8 

57.34 

54.67 

56.81 

71.66 

59.14 

70.45 

75.96 

77.05 

59.32 

67.55 

78.05 

83.2 

WAT 

80.0 

86.49 

83.82 

83.59 

81.82 

80.25 

84.34 

84.12 

84.21 

84.22 

76.82 

77.64 

65.18 

68.24 

61.14 

59.36 

58.99 

53.13 

59.56 

73.89 

61.96 

72.65 

77.72 

79.08 

64.38 

65.81 

68.21 

76.0 

Wikipedia Miner 

77.14 

86.36 

64.72 

66.17 

61.65 

61.67 

60.71 

63.19 

66.48 

67.93 

75.96 

74.63 

62.57 

61.43 

58.59 

56.98 

41.63 

35.0 

54.88 

69.29 

55.93 

67.0 

64.25 

64.68 

60.05 

66.51 

64.54 

72.23 

PBoH 

87.19 

90.40 

86.72 

86.85 

86.63 

85.48 

87.39 

86.32 

86.59 

87.30 

86.64 

86.14 

79.48 

80.13 

62.47 

61.04 

61.70 

55.83 

74.19 

84.48 

73.08 

81.25 

89.54 

89.62 

76.54 

83.31 

71.24 

78.33 


Table 2: Micro and macro FI scores reported by Gerbil for 14 datasets and 11 entity linking systems including 
PBoH. For each dataset and each metric, we highlight in red the best system and in blue the second best 
system. 



Datasets 

AIDA test A 

AIDA test B 

Systems 

R@MI 

R®MA 

R@MI 

R®MA 

LocalMention 

69.73 

69.30 

67.98 

72.75 

TagMe reimpl. 

76.89 

74.57 

78.64 

78.21 

AIDA 

79.29 

77.00 

82.54 

81.66 

S & Y 

- 

84.22 

- 

- 

Houlsby et al. 

79.65 

76.61 

84.89 

83.51 

PBoH 

85.70 

85.26 

87.61 

86.44 


Table 3: AIDA test-a and AIDA test-b datasets re¬ 
sults. 


also report micro-recall R@MI and macro-recall ROMA 
on the AIDA datasets. 

Note that, in our case, the precision and recall are not 
necessarily identical since a method may not consider anno¬ 
tating certain mentions Q 

Pseudo-likelihood training. We briefly mention some of 
the practical issues that we encounter with the likelihood 
maximization described in Section |3.4[ From the practical 
perspective, for each mention m, we only considered the set 
of parameters pm,e limited to the top 64 candidate entities e 
per mention, ranked by p{e\m) . Additionally, we restricted 
the set to entity pairs (e, e') that co-occurred together 
in at least 7 documents throughout the Wikipedia corpus. In 
total, a set of 26 millions p and 39 millions A parameters were 
learned using the previously described procedure. Note that 


the universe of all Wikipedia entities is of size ~ 4 million. 

For the SGD procedure, we tried different initializations 
of these parameters, including pm,e = log p(e|m), Ae^e' = 0, 
as well as the parameters given by Eq. ( |17| |. However, in all 
cases, the accuracy gain on a sample of 1000 Wikipedia test 
pages was small or negligible compared to the LocalMention 
baseline (described below). One reason is the inherent spar¬ 
sity of the data: the parameters associated with the long tail 
of infrequent entity pairs are updated rarely and expected 
to be defective at the end of the SGD procedure. However, 
these scattered pairs are crucial for the effectiveness and cov¬ 
erage of the entity disambiguation system. To overcome this 
problem, we refined our model as described in Section |3.5| 
and subsequent sections. 

PBoH training details. Wikipedia itself is a valuable re¬ 
source for entity linking since each internal hyperlink can be 
considered as the ground truth annotation for the respective 
anchor text. In our systemjthe training is solely done on 
the entire Wikipedia corpu^ Hyper-parameters are grid- 
searched such that the micro Fi plus macro Fi scores are 
maximized over the combined held-out set containing only 
the AIDA Test-A dataset and a Wikipedia validation set 
consisting of random 1000 pages. As a preprocessing step 
in our training procedure, we removed all annotations and 
hyperlinks that point to non-existing, disambiguation or list 
Wikipedia pages. 

The PBoH system used in the experimental comparison 
®We used the Wikipedia dump from February 2014 






















































Datasets 

new MSNBC 

new AQUAINT 

new ACE2004 

Systems 

FIOMI 

F1@MA 

F1@MI 

F1@MA 

F1@MI 

F1@MA 

LocalMention 

73.64 

77.71 

87.33 

86.80 

84.75 

85.70 

Cucerzan 

88.34 

87.76 

78.67 

78.22 

79.30 

78.22 

M & W 

78.43 

80.37 

85.13 

84.84 

81.29 

84.25 

Han et al. 

88.46 

87.93 

79.46 

78.80 

73.48 

66.80 

AIDA 

78.81 

76.26 

56.47 

56.46 

80.49 

84.13 

GLOW 

75.37 

77.33 

83.14 

82.97 

81.91 

83.18 

RI 

90.22 

90.87 

87.72 

87.74 

86.60 

87.13 

REL-RW 

91.37 

91.73 

90.74 

90.58 

87.68 

89.23 

PBoH 

91.06 

91.19 

89.27 

88.94 

88.71 

88.46 


Table 4: Results on the newer versions of the MSNBC, AQUAINT and ACE04 datasets. 


is the model given by Eq. (261 for which grid search of the 
hyper-parameters suggested using ( = 0.075, r = 0.5, S = 
0.5,C = 0.5. 


Datasets. We evaluate our approach on 14 well-known pub¬ 
lic entity linking datasets built from various sources. Statis¬ 
tics of some of them are shown in Table and their de¬ 
scriptions are provided below. For information on the other 
datasets used only in the Gerbil experiments, refer to [37| . 

• The CoNLL-AIDA dataset is an entity annotated cor¬ 
pus of Reuters news documents introduced by Hoffart 
et al. [^. It is much larger than most of the other ex¬ 
isting EL datasets, making it an excellent evaluation 
target. The data is divided in three parts: Train (not 
used in our current setting for training, but only in the 
Gerbil evaluation), Test-A (used for validation) and 
Test-B (used for blind evaluation). Similar to Houlsby 
and Ciaramita and others, we report results also 
on the validation set Test-A. 


• The AQUAINT dataset introduced by Milne and Wit¬ 
ten contains documents from a news corpus from 
the Xinhua News Service, the New York Times and 
the Associated Press. 

• MSNBC [^ - a dataset of news documents that in¬ 
cludes many mentions which do not easily map to 
Wikipedia titles because of their rare surface forms or 
distinctive lexicalization. 


• The A dataset 27 is a subset of ACE2004 Coref¬ 
erence documents annotated using Amazon Mechani¬ 
cal Turk. Note that the ACE04 dataset contains men¬ 
tions that are annotated with NIL entities, meaning 
that no proper Wikipedia entity was found. Following 
common practice, we removed all the mentions corre¬ 
sponding to these NIL entities prior to our evaluation. 


Note that the Gerbil platform uses an old version of the 
AQUAINT, MSNBC and ACE04 datasets that contain some 
no-longer existing Wikipedia entities. A new cleaned version 
of these setj^ was released by Guo & Barbosa . We 
report results for the new cleaned datasets in Table 1^ while 
Table contains results for the old versions currently used 
by Gerbil. 


^*'http: //www. cs .ualberta. ca/'denilson/data/ 
deosl4_ualberta_experiments.tgz 



Datasets 

AIDA 
test A 

AIDA 

test B 

MSNBC 

AQUAINT 

ACE04 

Avg. num 
mentions 
per doc 

22.18 

19.41 

32.8 

14.54 

7.34 

Conv. rate 

100% 

99.56% 

100% 

100% 

100% 

Avg. run¬ 
ning time 
(ms/doc) 

445.56 

203.66 

371.65 

40.42 

10.88 

Avg. num. 
rounds 

2.86 

2.83 

3.0 

2.56 

2.25 


Table 5: Loopy belief propagation statistics. Av¬ 
erage running time, number of rounds and conver¬ 
gence rate of our inference procedure are provided. 


Systems. For comparison, we selected a broad range of com¬ 
petitor systems from the vast literature in this field. The 
Gerbil platform already integrates the methods of Agdis- 
tis [36|, Babelfy [23], D Bpedia Spotlight |20|, Dexter [^, 
Kea |33|, N erd-ML |29| , Tagme 2 WAT Wikipedia 
Miner [2'^ and Illinois Wikifier |27| . We furthermore com¬ 
pare against Cucerzan - the first collective EL system 
that uses optimization techniques, M& W [^- a popular 
machine learning approach, Han et al. 13 - a graph based 
disambiguation system that uses random walks for joint dis¬ 
ambiguation, AIDA - a performant graph based ap¬ 
proach, GLOW - a system that uses local and global 
context to perform joint entity disambiguation, RI [^ - an 
approach using relational inference for mention disambigua¬ 
tion, and REL-RW [^, a recent system that iteratively 
solves mentions relying on an online updating random walk 
model. In addition, on the AIDA datasets we also compare 
against S& Y - an apparatus for combining the NER 
and EL tasks, and Houlsby et al. 16 - a topic modelling 
LDA-based approach for EL. 

To empirically assess the accuracy gain introduced by each 
incremental step of our approach, we ran experiments on 
several of our method’s components, individually: Local- 
Mention - links mentions to entities solely based on the 
token span statistics, i.e., e* = argmaXgp(e|m); Unnorm 
- uses the unnormalized mention-entity model described in 
Section |3.5| Rescaled - relies on the rescaled model pre¬ 
sented in Section |3.6| LocalContext - disambiguates an 
entity based on the mention and the local context proba¬ 
bility given by Equation ( [^ , i.e., e* = argmaXgp(e|m,c). 
Note that Unnorm, Rescaled and PBoH use the loopy 
belief propagation procedure for inference. 




















































Datasets 

MSNBC 

AQUAINT 

ACE2004 

Avg ^ men¬ 
tions per doc 

36.95 

14.54 

8.68 

Systems 

# entities 

^ entities 

^ entities 

PBoH 

247.19 

95.38 

66.66 

REL-RW 

382773.6 

242443.1 

256235.49 


Table 6: Average number of entities that appear in 
the graph built by PBoH and by REL-RW 

5.1 Results 

Results of the experiments run on the Gerbil platfor m are 
shown in Table Detailed results are also providec j’^’IPI 
We obtain the highest performance on 11 datasets and the 
second highest performance on 2 datasets, showing the ef¬ 
fectiveness of our method. 

Other results are presented in Table and Table The 
highest accuracy for the cleaned version of AQUAINT, MSNBC 
and ACE04 was previously reported by Guo & Barbosa , 
while Houlsby et al. dominate the AIDA datasets. Note 
that the performance of the baseline systems shown in these 
two tables is taken from and [16| . 

All these methods are tested in the setting where a fixed 
set of mentions is given as input, without requiring the men¬ 
tion detection step. 

Discussion. Several observations are worth noting here. 

First, the simple LocalMention component alone outper¬ 
forms many EL systems. However, our experimental results 
show that PBoH consistently beats LocalMention on all the 
datasets. Second, PBoH produces state-of-the-art results 
on both development (Test-A) and blind evaluation (Test- 
B) parts of the AIDA dataset. Third, on the AQUAINT, 

MSNBC and ACE04 datasets, PBoH outperforms all but 
one of the presented EL systems and is competitive with the 
state-of-art approaches. The method whose performance is 
closer to ours is REL-RW whose average FI score is only 
slightly higher than ours {+0.6 on average). However, there 
are significant advantages of our method that make it easier 
to use for practitioners. First, our approach is conceptually 
simpler and only requires sufficient statistics computed from 
Wikipedia. Second, PBoH shows a superior computational 
complexity manifested in significantly lower run times (Ta¬ 
ble]^, making it a good fit for large-scale real-time entity 
linking systems; this is not the case for REL-RW qualified 
as “time consuming” by its authors. Third, the number of 
entities in the underlying graph, and thus the required mem¬ 
ory, is significantly lower for PBoH (see statistics provided 
in Table [^. 

Incremental accuracy gains 

To give further insight to our method. Table provides 
an overview of the contribution brought step by step by 
each incremental component of the Full PBoH system. It 
can be noted that PBoH performs best, outranking all its 
individual components. 

^^The PBoH Gerbil experiment is available at http:// 
gerbil. ciksw. org/gerbil/experiment?id=201510160025 

^“The detailed Gerbil results of the baseline sys¬ 
tems can be accessed at http://gerbil.ciksw.org/gerbil/ [4] 

experiment?id=201510160026 



Datasets 

AIDA test A 

AIDA test B 

Systems 

R@MI 

ROMA 

ROMI 

ROMA 

LocalMention 

69.73 

69.30 

67.98 

72.75 

Unnorm 

69.77 

69.95 

75.87 

75.12 

Rescaled 

75.09 

74.25 

74.76 

78.28 

LocalContext 

82.50 

81.56 

85.46 

84.08 

PBoH 

85.53 

85.09 

87.51 

86.39 


Table 7: Accuracy gains of individual PBoH com¬ 
ponents. 

Reproducibility of the experiments 

Our experiments are easily reproducible using the details 
provided in this paper. Our learning procedure is only based 
on statistics coming from the set of Wikipedia webpages. As 
a consequence, one can implement a real-time highly accu¬ 
rate entity disambiguation system solely based on the details 
described in this paper. 

Our code is publicly available at : https://github.com/ 
dalab/pboh-entity-linking 

6. CONCLUSION 

In this paper, we described a light-weight graphical model 
for entity linking via approximate inference. Our method 
employs simple sufficient statistics that rely on three sources 
of information: First, a probabilistic name to entity map 
p(e|m) derived from a large corpus of hyperlinks; second, 
observational data about the pairwise co-occurrence of en¬ 
tities within documents from a Web collection; third, entity 
- contextual words statistics. Our experiments based on a 
number of popular entity linking benchmarking collections 
show improved performance as compared to several well- 
known or recent systems. 

There are several promising directions of future work. 
Currently, our model considers only pairwise potentials. In 
the future, it would be interesting to investigate the use 
of higher-order potentials and submodular optimization in 
an entity linking pipeline, thus allowing us to capture the 
interplay between entire groups of entity candidates (e.g., 
through the use of entity categories). Additionally, we will 
further enrich our probabilistic model with statistics from 
new sources of information. We expect some of the per¬ 
formance gains that other papers report from using entity 
categories or semantic relations to be additive with regard 
to our system’s current accuracy. 
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